Apply PDP to the regression example of predicting bike rentals. Fit a random forest approximation for the prediction of bike rentals (cnt). Use the partial dependence plot to visualize the relationships the model learned. Use the slides shown in class as model.
Analyse the influence of days since 2011, temperature, humidity and wind speed on the predicted bike counts.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(reshape2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(randomForestSRC)
## Warning: package 'randomForestSRC' was built under R version 4.1.3
##
## randomForestSRC 3.1.0
##
## Type rfsrc.news() to see new features, changes, and bug fixes.
##
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(ggplot2)
#setwd("/Users/cmonserr/OneDrive - UPV/Trabajo_2/Asignaturas/Evaluacion de modelos/Practicas/Practica 3/Bike-Sharing-Dataset")
days <- read.csv("day.csv")
hour <- read.csv("hour.csv")
days$dteday <- as_date(days$dteday)
days_since <- select(days, workingday, holiday, temp, hum, windspeed, cnt)
days_since$days_since_2011 <- int_length(interval(ymd("2011-01-01"), days$dteday)) / (3600*24)
days_since$SUMMER <- ifelse(days$season == 3, 1, 0)
days_since$FALL <- ifelse(days$season == 4, 1, 0)
days_since$WINTER <- ifelse(days$season == 1, 1, 0)
days_since$MISTY <- ifelse(days$weathersit == 2, 1, 0)
days_since$RAIN <- ifelse(days$weathersit == 3 | days$weathersit == 4, 1, 0)
days_since$temp <- days_since$temp * 47 - 8
days_since$hum <- days_since$hum * 100
days_since$windspeed <- days_since$windspeed * 67
rf <- rfsrc(cnt~., data=days_since, importance=TRUE)
results <- select(days_since, days_since_2011, temp, hum, windspeed, cnt)
nr <- nrow(days_since)
for(c in names(results)[1:4])
{
for(i in 1:nr){
r <- days_since
r[[c]] <- days_since[[c]][i]
sal <- predict(rf, r)$predicted
results[[c]][i] <- sum(sal) / nr
}
}
Generate a 2D Partial Dependency Plot with humidity and temperature to predict the number of bikes rented depending of those parameters.
BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the the data for the Partial Dependency Plot.
Show the density distribution of both input features with the 2D plot as shown in the class slides.
TIP: Use geom_tile() to generate the 2D plot. Set width and height to avoid holes.
Interpret the results.
sampled <- sample_n(days_since, 40)
temp <- sampled$temp
hum <- sampled$hum
th <- inner_join(data.frame(temp),data.frame(hum), by=character())
th$p <- 0
for(i in 1:nrow(th)){
r <- days_since
r[["temp"]] <- th[["temp"]][i]
r[["hum"]] <- th[["hum"]][i]
sal <- predict(rf, r)$predicted
th[["p"]][i] <- sum(sal) / nr
}
Apply the previous concepts to predict the price of a house from the database kc_house_data.csv. In this case, use again a random forest approximation for the prediction based on the features bedrooms, bathrooms, sqft_living, sqft_lot, floors and yr_built. Use the partial dependence plot to visualize the relationships the model learned.
BE CAREFUL: due to the size, extract a set of random samples from the BBDD before generating the data for the Partial Dependency Plot.
Analyse the influence of bedrooms, bathrooms, sqft_living and floors on the predicted price.
d <- read.csv("kc_house_data.csv")
sampled <- sample_n(d, 1000)
sampled <- select(sampled, bedrooms, bathrooms, sqft_living, sqft_lot, floors, yr_built, price)
rf <- rfsrc(price~., data=sampled)
results <- select(sampled, bedrooms, bathrooms, sqft_living, floors, price)
nr <- nrow(sampled)
for(c in names(results)[1:4]){
for(i in 1:nr){
r <- sampled
r[[c]] <- sampled[[c]][i]
sal <- predict(rf, r)$predicted
results[[c]][i] <- sum(sal) / nr
}
}
p1 <- ggplot(data = sampled, aes(x=bedrooms, y=results$bedrooms)) + geom_line() +
geom_rug(alpha=0.1, sides="b")+ labs(x="Bedrooms", y="Prediction")
p2 <- ggplot(data = sampled, aes(x=bathrooms, y=results$bathrooms)) + geom_line() +
geom_rug(alpha=0.1, sides="b") + labs(x="Bathrooms", y=NULL)+ xlim(0,5)
p3 <- ggplot(sampled, aes(x=sqft_living , y = results$sqft_living)) + geom_line() + geom_rug(alpha=0.1, sides="b") + labs(x="Sqft_living", y=NULL)
p4 <- ggplot(data = sampled, aes(x=floors, y=results$floors)) + geom_line() +
geom_rug(alpha=0.1, sides="b") + labs(x="Floors", y=NULL)
subplot(p1,p2,p3,p4, titleY=T, titleX=T)
Dormitorios: Podemos ver como el precio aumenta cuando se pasa de 1 a 2 dormitorios, lo que es lógico que sea más caro por contar con más dormitorios. Pero por otro lado vemos como el precio disminuye hasta los 4 dormitorios y de nuevo vuelve a aumentar lo que no parece tener un sentido tan claro. Baños: Vemos como el precio aumenta a medida que aumenta el número de baños. Parece que de 4 a 5 baños el precio disminuye, pero la diferencia es mínima. Esto tiene sentido ya que una casa con mas baños tiene más valor por regla general, porque suele ser una vivienda de un nivel mayor. Metros de la vivienda: A media que aumentan los metros en una vivienda sube su precio considerablemente, lo que es lógico ya que una casa más grande tiene un precio mayor. Piso: A medida que aumenta el número de piso aumenta el valor de la vivienda. Esto es normal porque los pisos con mayor altura suelen tener precios mayores ya que a la gente le suele gustar más, por vistas, ausencia de ruidos de la calle, seguridad, etc.